This time, we will use Prosper loan data to do the following basic analysis, what we want to find is that what factors will impact the APR (Annuall Percentage Rate) and build one prediction model. Meanwhile, we want to let Borrower know how they can reduce their BorrowerAPR.
Show the variables’ basic meaning
| Variable | Description |
|---|---|
| ListingKey | Unique key for each listing, same value as the ‘key’ used in the listing object in the API. |
| ListingNumber | The number that uniquely identifies the listing to the public as displayed on the website. |
| ListingCreationDate | The date the listing was created. |
| CreditGrade | The Credit rating that was assigned at the time the listing went live. Applicable for listings pre-2009 period and will only be populated for those listings. |
| Term | The length of the loan expressed in months. |
| LoanStatus | The current status of the loan: Cancelled, Chargedoff, Completed, Current, Defaulted, FinalPaymentInProgress, PastDue. The PastDue status will be accompanied by a delinquency bucket. |
| ClosedDate | Closed date is applicable for Cancelled, Completed, Chargedoff and Defaulted loan statuses. |
| BorrowerAPR | The Borrower’s Annual Percentage Rate (APR) for the loan. |
| BorrowerRate | The Borrower’s interest rate for this loan. |
| LenderYield | The Lender yield on the loan. Lender yield is equal to the interest rate on the loan less the servicing fee. |
| EstimatedEffectiveYield | Effective yield is equal to the borrower interest rate (i) minus the servicing fee rate, (ii) minus estimated uncollected interest on charge-offs, (iii) plus estimated collected late fees. Applicable for loans originated after July 2009. |
| EstimatedLoss | Estimated loss is the estimated principal loss on charge-offs. Applicable for loans originated after July 2009. |
| EstimatedReturn | The estimated return assigned to the listing at the time it was created. Estimated return is the difference between the Estimated Effective Yield and the Estimated Loss Rate. Applicable for loans originated after July 2009. |
| ProsperRating (numeric) | The Prosper Rating assigned at the time the listing was created: 0 - N/A, 1 - HR, 2 - E, 3 - D, 4 - C, 5 - B, 6 - A, 7 - AA. Applicable for loans originated after July 2009. |
| ProsperRating (Alpha) | The Prosper Rating assigned at the time the listing was created between AA - HR. Applicable for loans originated after July 2009. |
| ProsperScore | A custom risk score built using historical Prosper data. The score ranges from 1-10, with 10 being the best, or lowest risk score. Applicable for loans originated after July 2009. |
| ListingCategory | The category of the listing that the borrower selected when posting their listing: 0 - Not Available, 1 - Debt Consolidation, 2 - Home Improvement, 3 - Business, 4 - Personal Loan, 5 - Student Use, 6 - Auto, 7- Other, 8 - Baby&Adoption, 9 - Boat, 10 - Cosmetic Procedure, 11 - Engagement Ring, 12 - Green Loans, 13 - Household Expenses, 14 - Large Purchases, 15 - Medical/Dental, 16 - Motorcycle, 17 - RV, 18 - Taxes, 19 - Vacation, 20 - Wedding Loans |
| BorrowerState | The two letter abbreviation of the state of the address of the borrower at the time the Listing was created. |
| Occupation | The Occupation selected by the Borrower at the time they created the listing. |
| EmploymentStatus | The employment status of the borrower at the time they posted the listing. |
| EmploymentStatusDuration | The length in months of the employment status at the time the listing was created. |
| IsBorrowerHomeowner | A Borrower will be classified as a homowner if they have a mortgage on their credit profile or provide documentation confirming they are a homeowner. |
| CurrentlyInGroup | Specifies whether or not the Borrower was in a group at the time the listing was created. |
| GroupKey | The Key of the group in which the Borrower is a member of. Value will be null if the borrower does not have a group affiliation. |
| DateCreditPulled | The date the credit profile was pulled. |
| CreditScoreRangeLower | The lower value representing the range of the borrower’s credit score as provided by a consumer credit rating agency. |
| CreditScoreRangeUpper | The upper value representing the range of the borrower’s credit score as provided by a consumer credit rating agency. |
| FirstRecordedCreditLine | The date the first credit line was opened. |
| CurrentCreditLines | Number of current credit lines at the time the credit profile was pulled. |
| OpenCreditLines | Number of open credit lines at the time the credit profile was pulled. |
| TotalCreditLinespast7years | Number of credit lines in the past seven years at the time the credit profile was pulled. |
| OpenRevolvingAccounts | Number of open revolving accounts at the time the credit profile was pulled. |
| OpenRevolvingMonthlyPayment | Monthly payment on revolving accounts at the time the credit profile was pulled. |
| InquiriesLast6Months | Number of inquiries in the past six months at the time the credit profile was pulled. |
| TotalInquiries | Total number of inquiries at the time the credit profile was pulled. |
| CurrentDelinquencies | Number of accounts delinquent at the time the credit profile was pulled. |
| AmountDelinquent | Dollars delinquent at the time the credit profile was pulled. |
| DelinquenciesLast7Years | Number of delinquencies in the past 7 years at the time the credit profile was pulled. |
| PublicRecordsLast10Years | Number of public records in the past 10 years at the time the credit profile was pulled. |
| PublicRecordsLast12Months | Number of public records in the past 12 months at the time the credit profile was pulled. |
| RevolvingCreditBalance | Dollars of revolving credit at the time the credit profile was pulled. |
| BankcardUtilization | The percentage of available revolving credit that is utilized at the time the credit profile was pulled. |
| AvailableBankcardCredit | The total available credit via bank card at the time the credit profile was pulled. |
| TotalTrades | Number of trade lines ever opened at the time the credit profile was pulled. |
| TradesNeverDelinquent | Number of trades that have never been delinquent at the time the credit profile was pulled. |
| TradesOpenedLast6Months | Number of trades opened in the last 6 months at the time the credit profile was pulled. |
| DebtToIncomeRatio | The debt to income ratio of the borrower at the time the credit profile was pulled. This value is Null if the debt to income ratio is not available. This value is capped at 10.01 (any debt to income ratio larger than 1000% will be returned as 1001%). |
| IncomeRange | The income range of the borrower at the time the listing was created. |
| IncomeVerifiable | The borrower indicated they have the required documentation to support their income. |
| StatedMonthlyIncome | The monthly income the borrower stated at the time the listing was created. |
| LoanKey | Unique key for each loan. This is the same key that is used in the API. |
| TotalProsperLoans | Number of Prosper loans the borrower at the time they created this listing. This value will be null if the borrower had no prior loans. |
| TotalProsperPaymentsBilled | Number of on time payments the borrower made on Prosper loans at the time they created this listing. This value will be null if the borrower had no prior loans. |
| OnTimeProsperPayments | Number of on time payments the borrower had made on Prosper loans at the time they created this listing. This value will be null if the borrower has no prior loans. |
| ProsperPaymentsLessThanOneMonthLate | Number of payments the borrower made on Prosper loans that were less than one month late at the time they created this listing. This value will be null if the borrower had no prior loans. |
| ProsperPaymentsOneMonthPlusLate | Number of payments the borrower made on Prosper loans that were greater than one month late at the time they created this listing. This value will be null if the borrower had no prior loans. |
| ProsperPrincipalBorrowed | Total principal borrowed on Prosper loans at the time the listing was created. This value will be null if the borrower had no prior loans. |
| ProsperPrincipalOutstanding | Principal outstanding on Prosper loans at the time the listing was created. This value will be null if the borrower had no prior loans. |
| ScorexChangeAtTimeOfListing | Borrower’s credit score change at the time the credit profile was pulled. This will be the change relative to the borrower’s last Prosper loan. This value will be null if the borrower had no prior loans. |
| LoanCurrentDaysDelinquent | The number of days delinquent. |
| LoanFirstDefaultedCycleNumber | The cycle the loan was charged off. If the loan has not charged off the value will be null. |
| LoanMonthsSinceOrigination | Number of months since the loan originated. |
| LoanNumber | Unique numeric value associated with the loan. |
| LoanOriginalAmount | The origination amount of the loan. |
| LoanOriginationDate | The date the loan was originated. |
| LoanOriginationQuarter | The quarter in which the loan was originated. |
| MemberKey | The unique key that is associated with the borrower. This is the same identifier that is used in the API member object. |
| MonthlyLoanPayment | The scheduled monthly loan payment. |
| LP_CustomerPayments | Pre charge-off cumulative gross payments made by the borrower on the loan. If the loan has charged off, this value will exclude any recoveries. |
| LP_CustomerPrincipalPayments | Pre charge-off cumulative principal payments made by the borrower on the loan. If the loan has charged off, this value will exclude any recoveries. |
| LP_InterestandFees | Pre charge-off cumulative interest and fees paid by the borrower. If the loan has charged off, this value will exclude any recoveries. |
| LP_ServiceFees | Cumulative service fees paid by the investors who have invested in the loan. |
| LP_CollectionFees | Cumulative collection fees paid by the investors who have invested in the loan. |
| LP_GrossPrincipalLoss | The gross charged off amount of the loan. |
| LP_NetPrincipalLoss | The principal that remains uncollected after any recoveries. |
| LP_NonPrincipalRecoverypayments | The interest and fee component of any recovery payments. The current payment policy applies payments in the following order: Fees, interest, principal. |
| PercentFunded | Percent the listing was funded. |
| Recommendations | Number of recommendations the borrower had at the time the listing was created. |
| InvestmentFromFriendsCount | Number of friends that made an investment in the loan. |
| InvestmentFromFriendsAmount | Dollar amount of investments that were made by friends. |
| Investors | The number of investors that funded the loan. |
Show the structure of the data set.
## 'data.frame': 113937 obs. of 81 variables:
## $ ListingKey : Factor w/ 113066 levels "00003546482094282EF90E5",..: 7180 7193 6647 6669 6686 6689 6699 6706 6687 6687 ...
## $ ListingNumber : int 193129 1209647 81716 658116 909464 1074836 750899 768193 1023355 1023355 ...
## $ ListingCreationDate : Factor w/ 113064 levels "2005-11-09 20:44:28.847000000",..: 14184 111894 6429 64760 85967 100310 72556 74019 97834 97834 ...
## $ CreditGrade : Factor w/ 9 levels "","A","AA","B",..: 5 1 8 1 1 1 1 1 1 1 ...
## $ Term : int 36 36 36 36 36 60 36 36 36 36 ...
## $ LoanStatus : Factor w/ 12 levels "Cancelled","Chargedoff",..: 3 4 3 4 4 4 4 4 4 4 ...
## $ ClosedDate : Factor w/ 2803 levels "","2005-11-25 00:00:00",..: 1138 1 1263 1 1 1 1 1 1 1 ...
## $ BorrowerAPR : num 0.165 0.12 0.283 0.125 0.246 ...
## $ BorrowerRate : num 0.158 0.092 0.275 0.0974 0.2085 ...
## $ LenderYield : num 0.138 0.082 0.24 0.0874 0.1985 ...
## $ EstimatedEffectiveYield : num NA 0.0796 NA 0.0849 0.1832 ...
## $ EstimatedLoss : num NA 0.0249 NA 0.0249 0.0925 ...
## $ EstimatedReturn : num NA 0.0547 NA 0.06 0.0907 ...
## $ ProsperRating..numeric. : int NA 6 NA 6 3 5 2 4 7 7 ...
## $ ProsperRating..Alpha. : Factor w/ 8 levels "","A","AA","B",..: 1 2 1 2 6 4 7 5 3 3 ...
## $ ProsperScore : num NA 7 NA 9 4 10 2 4 9 11 ...
## $ ListingCategory..numeric. : int 0 2 0 16 2 1 1 2 7 7 ...
## $ BorrowerState : Factor w/ 52 levels "","AK","AL","AR",..: 7 7 12 12 25 34 18 6 16 16 ...
## $ Occupation : Factor w/ 68 levels "","Accountant/CPA",..: 37 43 37 52 21 43 50 29 24 24 ...
## $ EmploymentStatus : Factor w/ 9 levels "","Employed",..: 9 2 4 2 2 2 2 2 2 2 ...
## $ EmploymentStatusDuration : int 2 44 NA 113 44 82 172 103 269 269 ...
## $ IsBorrowerHomeowner : Factor w/ 2 levels "False","True": 2 1 1 2 2 2 1 1 2 2 ...
## $ CurrentlyInGroup : Factor w/ 2 levels "False","True": 2 1 2 1 1 1 1 1 1 1 ...
## $ GroupKey : Factor w/ 707 levels "","00343376901312423168731",..: 1 1 335 1 1 1 1 1 1 1 ...
## $ DateCreditPulled : Factor w/ 112992 levels "2005-11-09 00:30:04.487000000",..: 14347 111883 6446 64724 85857 100382 72500 73937 97888 97888 ...
## $ CreditScoreRangeLower : int 640 680 480 800 680 740 680 700 820 820 ...
## $ CreditScoreRangeUpper : int 659 699 499 819 699 759 699 719 839 839 ...
## $ FirstRecordedCreditLine : Factor w/ 11586 levels "","1947-08-24 00:00:00",..: 8639 6617 8927 2247 9498 497 8265 7685 5543 5543 ...
## $ CurrentCreditLines : int 5 14 NA 5 19 21 10 6 17 17 ...
## $ OpenCreditLines : int 4 14 NA 5 19 17 7 6 16 16 ...
## $ TotalCreditLinespast7years : int 12 29 3 29 49 49 20 10 32 32 ...
## $ OpenRevolvingAccounts : int 1 13 0 7 6 13 6 5 12 12 ...
## $ OpenRevolvingMonthlyPayment : num 24 389 0 115 220 1410 214 101 219 219 ...
## $ InquiriesLast6Months : int 3 3 0 0 1 0 0 3 1 1 ...
## $ TotalInquiries : num 3 5 1 1 9 2 0 16 6 6 ...
## $ CurrentDelinquencies : int 2 0 1 4 0 0 0 0 0 0 ...
## $ AmountDelinquent : num 472 0 NA 10056 0 ...
## $ DelinquenciesLast7Years : int 4 0 0 14 0 0 0 0 0 0 ...
## $ PublicRecordsLast10Years : int 0 1 0 0 0 0 0 1 0 0 ...
## $ PublicRecordsLast12Months : int 0 0 NA 0 0 0 0 0 0 0 ...
## $ RevolvingCreditBalance : num 0 3989 NA 1444 6193 ...
## $ BankcardUtilization : num 0 0.21 NA 0.04 0.81 0.39 0.72 0.13 0.11 0.11 ...
## $ AvailableBankcardCredit : num 1500 10266 NA 30754 695 ...
## $ TotalTrades : num 11 29 NA 26 39 47 16 10 29 29 ...
## $ TradesNeverDelinquent..percentage. : num 0.81 1 NA 0.76 0.95 1 0.68 0.8 1 1 ...
## $ TradesOpenedLast6Months : num 0 2 NA 0 2 0 0 0 1 1 ...
## $ DebtToIncomeRatio : num 0.17 0.18 0.06 0.15 0.26 0.36 0.27 0.24 0.25 0.25 ...
## $ IncomeRange : Factor w/ 8 levels "$0","$1-24,999",..: 4 5 7 4 3 3 4 4 4 4 ...
## $ IncomeVerifiable : Factor w/ 2 levels "False","True": 2 2 2 2 2 2 2 2 2 2 ...
## $ StatedMonthlyIncome : num 3083 6125 2083 2875 9583 ...
## $ LoanKey : Factor w/ 113066 levels "00003683605746079487FF7",..: 100337 69837 46303 70776 71387 86505 91250 5425 908 908 ...
## $ TotalProsperLoans : int NA NA NA NA 1 NA NA NA NA NA ...
## $ TotalProsperPaymentsBilled : int NA NA NA NA 11 NA NA NA NA NA ...
## $ OnTimeProsperPayments : int NA NA NA NA 11 NA NA NA NA NA ...
## $ ProsperPaymentsLessThanOneMonthLate: int NA NA NA NA 0 NA NA NA NA NA ...
## $ ProsperPaymentsOneMonthPlusLate : int NA NA NA NA 0 NA NA NA NA NA ...
## $ ProsperPrincipalBorrowed : num NA NA NA NA 11000 NA NA NA NA NA ...
## $ ProsperPrincipalOutstanding : num NA NA NA NA 9948 ...
## $ ScorexChangeAtTimeOfListing : int NA NA NA NA NA NA NA NA NA NA ...
## $ LoanCurrentDaysDelinquent : int 0 0 0 0 0 0 0 0 0 0 ...
## $ LoanFirstDefaultedCycleNumber : int NA NA NA NA NA NA NA NA NA NA ...
## $ LoanMonthsSinceOrigination : int 78 0 86 16 6 3 11 10 3 3 ...
## $ LoanNumber : int 19141 134815 6466 77296 102670 123257 88353 90051 121268 121268 ...
## $ LoanOriginalAmount : int 9425 10000 3001 10000 15000 15000 3000 10000 10000 10000 ...
## $ LoanOriginationDate : Factor w/ 1873 levels "2005-11-15 00:00:00",..: 426 1866 260 1535 1757 1821 1649 1666 1813 1813 ...
## $ LoanOriginationQuarter : Factor w/ 33 levels "Q1 2006","Q1 2007",..: 18 8 2 32 24 33 16 16 33 33 ...
## $ MemberKey : Factor w/ 90831 levels "00003397697413387CAF966",..: 11071 10302 33781 54939 19465 48037 60448 40951 26129 26129 ...
## $ MonthlyLoanPayment : num 330 319 123 321 564 ...
## $ LP_CustomerPayments : num 11396 0 4187 5143 2820 ...
## $ LP_CustomerPrincipalPayments : num 9425 0 3001 4091 1563 ...
## $ LP_InterestandFees : num 1971 0 1186 1052 1257 ...
## $ LP_ServiceFees : num -133.2 0 -24.2 -108 -60.3 ...
## $ LP_CollectionFees : num 0 0 0 0 0 0 0 0 0 0 ...
## $ LP_GrossPrincipalLoss : num 0 0 0 0 0 0 0 0 0 0 ...
## $ LP_NetPrincipalLoss : num 0 0 0 0 0 0 0 0 0 0 ...
## $ LP_NonPrincipalRecoverypayments : num 0 0 0 0 0 0 0 0 0 0 ...
## $ PercentFunded : num 1 1 1 1 1 1 1 1 1 1 ...
## $ Recommendations : int 0 0 0 0 0 0 0 0 0 0 ...
## $ InvestmentFromFriendsCount : int 0 0 0 0 0 0 0 0 0 0 ...
## $ InvestmentFromFriendsAmount : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Investors : int 258 1 41 158 20 1 1 1 1 1 ...
Wow, 81 variables, since not familar with loan data features, need to understand them with the following exploration and web searching. 113937 obs, not small.
After checking the summary of the data set, why so many NA values? Part of them have the same NA count, what causes the NA value? Confused. The information should be calculated automatically, e.g., EstimatedEffectiveYield.
Why duplicate ListingKey? Subset the duplicate ListingKey data for example.
| ListingKey | ProsperScore | |
|---|---|---|
| 13079 | 17A93590655669644DB4C06 | 4 |
| 14889 | 17A93590655669644DB4C06 | 8 |
| 20570 | 17A93590655669644DB4C06 | 7 |
| 31451 | 17A93590655669644DB4C06 | 10 |
| 42751 | 17A93590655669644DB4C06 | 5 |
| 42752 | 17A93590655669644DB4C06 | 6 |
The only difference is ProsperScore, how will cause the ProsperScore to change? not understood. So, for each loan data, if prosperScore changes, will be saved several times?
The main objective for this article is to find what factors will impact the BorrowerAPR, so we want to know the BorrowerAPR distribution firstly.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00653 0.15629 0.20976 0.21883 0.28381 0.51229 25
The most frequent BorrowerAPR should still be around 0.2. Another peak is around 0.36
Then we want to explore each variable one by one
## AA A B C D E NA's
## 85125 3509 3315 4389 5649 5153 3289 3508
What’s the meaning of ‘NC’? not correct? remove ‘NC’. C Credit Grade is with big probabiltiy, lots of loan records have no CreditGrade information, this is reasonable, because CreditGrade is used for assessing the loan before 2009 July. After 2009 July, we will use ProsperRating for each loan. As we know, CreditGrade or ProsperRating should be one import factor that impacts the APR. There’s no “HR” level in CreditGrade, is ‘NA’ ‘HR’ level?
Something wrong here, NA value should be ‘HR’, change NA to HR.
## AA A B C D E HR
## 85125 3509 3315 4389 5649 5153 3289 3508
## 12 36 60
## 1614 87778 24545
Doubt that may Term 12 has been canceled in the latest prosper loan, however, from the data, the creation time is not old, so this thought is wrong.
There are just three values for Term variable, the most frequent one is 36, three years.
## Cancelled Chargedoff Completed
## 5 11992 38074
## Current Defaulted FinalPaymentInProgress
## 56576 5018 205
## Past Due (>120 days) Past Due (1-15 days) Past Due (16-30 days)
## 16 806 265
## Past Due (31-60 days) Past Due (61-90 days) Past Due (91-120 days)
## 363 313 304
This LoanStatus is not the feature we care about for BorrowerAPR prediction, however, this one may can be used for predicting what kind of loan will be charged-off. This status WOW me, the probabiltiy for defaulted and charged-off is not small.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.1340 0.1840 0.1928 0.2500 0.4975
This feature is highly related with BorrowerAPR, BorrowerAPR = BorrowerRate + OrganizationFee. Will check whether organizationFee changes with Credit Grade or not, from the introduction is Prosper company, seems yes.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.000 3.000 4.000 4.072 5.000 7.000 29084
Numerice and alpha value describe the same thing, so can just keep one
## AA A B C D E NA's
## 29084 5372 14551 15581 18345 14274 9795 6935
Something wrong here, NA value should be ‘HR’, change NA to HR.
## AA A B C D E HR
## 29084 5372 14551 15581 18345 14274 9795 6935
As we talked before, combine CreditGrade and ProsperRating two columns to one column CreditRating that can describe the credit value.
## A AA B C D E HR
## 272 17866 8881 19970 23994 19427 13084 10443
Still have 131 loans that are with no CreditRating information.
## AA A B C D E HR
## 272 8881 17866 19970 23994 19427 13084 10443
All the credit information is combined. best -> worst, ‘AA’ -> ‘HR’.
## 1 2 3 4 5 6 7 8 9 10 11 NA's
## 992 5766 7642 12595 9813 12278 10597 12053 6911 4750 1456 29084
why there’s 11? in data decription file, 10 should be the highest value. What ever, best -> worst, 11 -> 1.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 1.000 1.000 2.774 3.000 20.000
The biggest probability is used for Debt consolication.
Did not understand the meaning of the two letter abbreviation.
The occupation should not be one key feature. Most borrowers choose the professional.
## Employed Full-time Not available Not employed
## 2255 67322 26355 5347 835
## Other Part-time Retired Self-employed
## 3806 1088 795 6134
The classes of EmploymentStatus are weired, employed, not employed, what’s the relationshipe between employed and Full-time, Part-time. Want to combine the levels to just two, employed and not employed.
## Employed Not employed
## 113102 835
Most borrowers are employed.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 26.00 67.00 96.07 137.00 755.00 7625
This is one long tail variable.
## False True
## 56459 57478
False and True are nearly 50% and 50%.
CreditScoreRangeLower and Upper should be combined into one range column, like income range.
Most Borrowers credit score in range 640 - 740. The uppper value = lower value + 19, so we can just keep one for the next revision.
Try to build a new feature to reduce the CreditScoreRange level so check whether will improve the relationship
Add one variable to judge the length of credit history, the longer history, should the lower APR.
The unit is day.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 4.00 6.00 6.97 9.00 51.00
The most frequent count is around 5.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 0.000 1.000 1.435 2.000 105.000 697
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 2.000 4.000 5.584 7.000 379.000 1159
Most InquiriesLast6Months are smaller than 2, outliers can be up to around 100. Most TotalInquiries are smaller than 10, some of outliers can be up to around 300.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0000 0.0000 0.0000 0.5921 0.0000 83.0000 697
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0 0.0 0.0 984.5 0.0 463881.0 7622
75% CurrentDelinquencies is 0, most of CurrentDelinquence is smaller than 10, the outlier can be up to 80. The most frequent amountDelinquent not 0 is around 1000.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 0.000 0.000 4.155 3.000 99.000 990
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 0.820 0.940 0.886 1.000 1.000 7544
The variance of TradesNeverDelinquent..percentage. is not so small, except the 25% borrowers, the other ones all have ever missed the repayment time. However, for DelinquenciesLast7Years, 50% borrowers have 0 delinquence.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 0.000 0.000 0.015 0.000 20.000 7604
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0000 0.0000 0.0000 0.3126 0.0000 38.0000 697
For this type of variable, actually, want to check, if ignore 0 value, will increase the relationshipe between it and BorrowerAPR?
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0 3121 8549 17599 19521 1435667 7604
The most frequent value is around 1e+04
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 0.310 0.600 0.561 0.840 5.950 7604
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0 880 4100 11210 13180 646285 7544
Most BankcardUtilization is smaller than 1, the most frequent AvailableBankcardCredit is around 10000.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 0.140 0.220 0.276 0.320 10.010 8554
The most frequent one is around 0.2.
What’s the meaning of ‘not displayed’, what’s the difference between ‘not employed’ and ‘$0’?
## False True
## 8669 105268
Most borrowers are IncomeVerifiable
There are 113937 obs. of 81 variables in this dataset. Each observation is one loan record. The dataset is collected from Prosper webBank, who is America’s frist peer-to-peer lending marketplace. Borrowers request personal loans and inverstor fund. Knowing more background will make us understand the data more easily and clearly.
Since this is one loan dataset, we care most is the BorrowerAPR. We want to known what factors will impact the BorrowerAPR and build one prediction model. The basic one should be CreditRating, however, some combination of other variables should also be used to build the prediction model.
investigation into your feature(s) of interest?
In my view, the features may imapct the BorrowerAPR includes at least: ProsperScore, EmploymentStatus, EmploymentFlag, IsBorrowerHomeowener, CreditScoreRange, OpenRevolvingAccounts, InquireisLast6Months, CurrentDelinquencies, AmountDelinquent, DelinquenciesLast7Years, PublicRecordsLast10Years, PublicRecordsLast12Months, BankcardUtilization, AvailableBankcardCredit, TradesNeverDelinquent, DebtToIncomeRatio, IncomeRange, IncomeVerifiable, LengthHistory etc.
After lots of research, find that the five important components for credit score, i.e., payment history, credit utilization, length of credit history, new credit and credit mix. The features in data set almost all have relationship with the five components. However, except CreditRating, ProsperScore, CreditScoreRange, which already take acount of the whole history information, Delinquency part, PublicRecords part, BankcardUtilization, AvailableBankcardCredit, DebtToIncomeRatio, LengthHistory should be the most related features.
Since CreditGrade is used for period before July, 2009 and ProsperRating is used fo period after July, 2009, i combine them to one CreditRating feature.
Create a variable EmploymentFlag to classify the emopolyment status to employed and not employed, will check more in the Bivariate Plots Section.
Create a variable CreditScoreRange to describe CrediteScoreRangeLower and Upper more directly. And then create a variable CreditRevision to reduce the levels.
Create a variable LengthHistory to judge the length of credit history.
Did you perform any operations on the data to tidy, adjust, or change the form
of the data? If so, why did you do this?
GreditGrade and ProsperRating miss ‘HR’ level, change the NA value to ‘HR’, since ProsperRating(numeric) use 1 for ‘HR’.
Change Term, ProsperScore to factor, since just limited levels.
Change FirstRecordedCreditLine and ListingCreationDate to Datetime
Select the variables we most care about to get the subset to calculate the relationship between each other
Correlation matrix
Select the relationship: Very Strong (0.8 - 1), Strong (0.6 - 0.8), Moderate (0.4 - 0.6), Weak (0.2 - 0.4)
| Var1 | Var2 | value |
|---|---|---|
| BorrowerRate | BorrowerAPR | 0.9898240 |
| TotalTrades | TotalCreditLinespast7years | 0.9364824 |
| CreditScoreRevision | CreditScoreRange | 0.9187899 |
| CreditRating | BorrowerRate | 0.8725063 |
| CreditRating | BorrowerAPR | 0.8697732 |
| TotalInquiries | InquiriesLast6Months | 0.7419499 |
| ProsperScore | CreditRating | -0.7052214 |
| ProsperScore | BorrowerAPR | -0.6682872 |
| ProsperScore | BorrowerRate | -0.6497361 |
| CreditScoreRange | CreditRating | -0.6361950 |
| CreditScoreRevision | CreditRating | -0.6178868 |
| IncomeVerifiable | DebtToIncomeRatio | -0.6005166 |
| CreditScoreRevision | AvailableBankcardCredit | 0.4718884 |
| TradesNeverDelinquent..percentage. | CreditScoreRange | 0.4686689 |
| AvailableBankcardCredit | CreditScoreRange | 0.4536941 |
| BankcardUtilization | CreditScoreRange | -0.4053380 |
| CreditScoreRevision | BankcardUtilization | -0.4193442 |
| CreditScoreRange | BorrowerAPR | -0.4513276 |
| TradesNeverDelinquent..percentage. | CurrentDelinquencies | -0.4587606 |
| CreditScoreRevision | BorrowerAPR | -0.4789308 |
| CreditScoreRange | BorrowerRate | -0.4834341 |
| CreditScoreRevision | BorrowerRate | -0.4944032 |
| TradesNeverDelinquent..percentage. | DelinquenciesLast7Years | -0.5164432 |
| LengthHistory | TotalTrades | 0.3981152 |
| CreditScoreRevision | TradesNeverDelinquent..percentage. | 0.3969301 |
| AvailableBankcardCredit | CreditRating | -0.3880894 |
| DelinquenciesLast7Years | CurrentDelinquencies | 0.3777769 |
| CurrentDelinquencies | CreditScoreRange | -0.3738819 |
| CreditScoreRange | ProsperScore | 0.3696030 |
| LengthHistory | TotalCreditLinespast7years | 0.3678176 |
| CreditScoreRevision | ProsperScore | 0.3587978 |
| AvailableBankcardCredit | BankcardUtilization | -0.3508306 |
| AvailableBankcardCredit | BorrowerAPR | -0.3489261 |
| AvailableBankcardCredit | BorrowerRate | -0.3438611 |
| AmountDelinquent | CurrentDelinquencies | 0.3405485 |
| AvailableBankcardCredit | ProsperScore | 0.3185580 |
| TotalTrades | IsBorrowerHomeowner | 0.3174058 |
| TradesNeverDelinquent..percentage. | CreditRating | -0.3157106 |
| CreditScoreRange | IsBorrowerHomeowner | 0.3023572 |
| InquiriesLast6Months | ProsperScore | -0.2967619 |
| BankcardUtilization | CreditRating | 0.2967502 |
| CreditScoreRevision | IsBorrowerHomeowner | 0.2949463 |
| TotalCreditLinespast7years | IsBorrowerHomeowner | 0.2935867 |
| TotalInquiries | CreditScoreRange | -0.2932329 |
| InquiriesLast6Months | CreditScoreRange | -0.2693860 |
| BankcardUtilization | BorrowerAPR | 0.2614380 |
| TradesNeverDelinquent..percentage. | BorrowerRate | -0.2611895 |
| DelinquenciesLast7Years | CreditScoreRange | -0.2598307 |
| BankcardUtilization | BorrowerRate | 0.2554820 |
| IncomeVerifiable | EmploymentFlag | -0.2551694 |
| TotalTrades | AvailableBankcardCredit | 0.2499171 |
| BankcardUtilization | ProsperScore | -0.2446956 |
| CurrentDelinquencies | CreditRating | 0.2441267 |
| TradesNeverDelinquent..percentage. | BorrowerAPR | -0.2413489 |
| TradesNeverDelinquent..percentage. | AvailableBankcardCredit | 0.2384296 |
| CreditScoreRevision | DelinquenciesLast7Years | -0.2364492 |
| CreditScoreRevision | CurrentDelinquencies | -0.2345581 |
| DelinquenciesLast7Years | AmountDelinquent | 0.2332703 |
| LengthHistory | CreditScoreRange | 0.2260040 |
| CreditScoreRevision | TotalInquiries | -0.2253238 |
| TotalInquiries | ProsperScore | -0.2157662 |
| InquiriesLast6Months | CreditRating | 0.2155871 |
| IncomeRange | EmploymentFlag | 0.2123003 |
| LengthHistory | IsBorrowerHomeowner | 0.2007115 |
From the correlation figure, we can see that there’s very strong relationship between variable BorrowerAPR and CreditRating, this meets our expectation, the higher CreditRating is, the lower BorrowerAPR should be.
Moreover, there’s strong relationship between variable BorrowerAPR and ProsperScore while meanwhile CreditRating also has strong relationship with ProsperScore.
BorrowerAPR has moderate relationship with CreditScoreRevision while CreditRating also has strong relationship with CreditScoreRevision.
BorrowerAPR has weak relationship with BankcardUtilization, AvaliableBankcardCredit and TradesNeverDelinquent. Meanwhile, they have weak relationship between each other. Moreover, they have moderate relationship with CreditScoreRange.
InquiriesLast6Months has strong relationship with TotalInquiries, reasonable.
Intesting, IncomeVefiable has strong relationship with DebtToIncomeRatio, why?
TradesNeverDelinquent has moderate relationship with CurrentDelinquencies, DelinquenciesLast7Years, which makes sense.
lots of weak relationship.
The created variables’ value is not that obvious. CreditScoreRevision imporves a little compared CreditScoreRange. Will use CreditScoreRevision for the following analysis.
have not found the variables that are related to BorrowerAPR while not related to CreditRating
One idea bingo, want to know whether the orgination fee changes with CreditRating, so we will build one variable OrginationFee and visulize it.
We can see that the orgination fee also changes with the CreditRationg level. The obvious diff is between level ‘AA’ and level “A”.
The result meets our expectation as mentioned before, However, why there are lots of outliers and the variance is not small? it seems there are still other variables control the BorrowerAPR trendency, their influence can not be ignored.
Keep CreditRating fixed, chech the influence of ProsperScore.
As the correlation value calculated before, strong relationship between BorrowerAPR and ProsperScore.
Plot the relationship between ProsperScore and CreditRating. Confused, how to get the CreditRating? how to calculate the ProsperScore?
From the figure, we can see, part of ‘AA’ borrowers still have high risk score, e.g., 4. Therefore, we still should keep the ProsperScore feature, it can descirbe a different dimension for BorrowerAPR prediction.
The sample number in both tail side is not enough, e.g., [0 - 440]. The score is got from customer credit rating agency, still that quesion, how to get credit rating?
From the figure, we can see CreditRating ‘AA’ borrowers may have a low CreditScore, why? Whatever, CreditScoreRange is still one important feature for the prediction.
Try to reduce the levels for CreditScoreRange to check whether can imporve the correlation value by this way.
## (0,640] (640,680] (680,720] (720,760] (760,800] (800,840] (840,880]
## 26605 32858 28394 15873 7268 1976 239
## NA's
## 724
Emmm, more clear than CreditScoreRange.
If understand correctly, this BankcardUtilization should mean ratio of your credit card balances to credit limits. The higher the BankcardUtilization, the higher BorrowerAPR, because high BankcardUtilization will make lender to think that there’s an increased risk.
Bases on before correlation calculation output, we know AvailableBankcardCredit should have weak relationship with BorrowerAPR, this figure shows this. With present knowledge, credit limits = credit balance + credit pending transaction + avaliable credit. This feature can show the credit limits inderectly.
The higher radesNeverDelinquent..percentage., the lower BorrowerAPR.
There’s no clear trend in EmploymentStatus figure, however, we can see the BorrowerAPR of ‘Not employed’ is obviously larger than other classes. So, create a new variable EmploymentFlag to check more.
This should be clear now, Employed may get a low Borrower APR, but why the correlation value is low between BorrowerAPR and EmployedFlag? If we check more carefully, the tail below 25% seems long.
The Trend is not that obvious, just a little move upward. However, for CurrentDelinquencies, the diff betweet 0 and 1 is much bigger than the diff between 1 and 2, 2 and 3.
The trend is a little more obvious than CurrentDelinquencies and AmountDelinquent.
The trend is not that obvious.
The Trend is not that obvious, just a little move upward.
The Trend is not that obvious.
As what
0 not reasonable, samples are not enough. What’s the difference between ‘$0’ and ‘not employed’. The trend move a little downward.
The same with CurrentDelinquencies, the diff is much larger between 0 and 1. For this kind of variable, acturally, we can build a new variable, 0 and not 0, to increase the relationship, am i right?
50% DebtToincome is 10.1 when IncomeVerifiable is false, this is the max value for DebtToincome, what’s the meaning of 10.1? it seems that the strong relationship between DebtToIncomRatio and IncomeVerifiable is meanless and not useful for our objective.
investigation. How did the feature(s) of interest vary with other features in
the dataset?
There’s very strong relationship between variable BorrowerAPR and CreditRating, this meets our expectation, the higher CreditRating is, the lower BorrowerAPR should be.
Moreover, there’s strong relationship between variable BorrowerAPR and ProsperScore while meanwhile CreditRating also has strong relationship with ProsperScore. Moreover, from the bar plot, part of ‘AA’ borrowers still have high risk score, e.g., 4. Therefore, we still should keep the ProsperScore feature, it can descirbe a different dimension for BorrowerAPR prediction.
BorrowerAPR has moderate relationship with CreditScoreRange while CreditRating also has moderate relationship with CreditScoreRange. Moreover, we can see CreditRating ‘AA’ borrowers may have a low CreditScore, why? Whatever, CreditScoreRange is still one feature for the prediction.
BorrowerAPR has weak relationship with BankcardUtilization, AvaliableBankcardCredit and TradesNeverDelinquent. Meanwhile, they have weak relationship between each other. Moreover, they have moderate relationship with CreditScoreRange.
What confuses me is that, where we get the CreditRating, ProsperScore? All these features should combine the important credit information, like, payment history, credit utilization, length of creit history, new credit, credit mix etc, all of them are history feature.
(not the main feature(s) of interest)?
InquiriesLast6Months has strong relationship with TotalInquiries, reasonable. TradesNeverDelinquent has moderate relationship with CurrentDelinquencies, DelinquenciesLast7Years, which makes sense. All these prove that the history can predict the present status.
IncomeVerifiable has strong relationship with DebtToIncomeRatio, then find that when IncomeVeriiable is False, 50% DebtToIncomeRatio is max value 10.1, do not know the meaning of this value, this relathionship should be not useful for our objective.
BorrowerAPR and CreditRating
The trend is not that obvious, we can still see the small downward with Prosper. However, seems not for D and E.
We know there’s weak relationship between CreditRating and BankcardUtilization, the figure proves that, the most obvious one is that ‘AA’ borrowers tend to have small BankcardUtilization, ‘HR’ and ‘E’ borrowers tend to have big BankcardUtilization.
From the figure, we can see the weak relationship between TradesNeverDelinquent..percentage. and CreditRating, more ‘HR’ borrowers have small TradesNeverDelinquent..percentage. compared with ‘AA’ borrower. Moreover, we can see that if keep the creditRating fixed, smaller TradesNeverDelinquent..percentage tends to have bigger BorrowerAPR.
investigation. Were there features that strengthened each other in terms of
looking at your feature(s) of interest?
BorrowerAPR has very strong relationship with CreditRating, strong relationship with ProsperScore, moderate relationship with CreditScoreRevision, weak relationship with BankcardUtilization, AvaliableBankcardCredit and TradesNeverDelinquent. Moreover, all these features have strong, moderate, weak relationship with CreditRating. However, if we dig deeper, can find each of them can describe a different dimension view of data. Where, wonder how to get the CreditRating value? calculated with history data, which includes payment history, credit utilization, length of creit history, new credit, credit mix etc? how to get the ProsperScore? how to get CreditScoreRange? what’s their difference?
lots of features have strong and moderate relationship with CreditRating. If all these features, i.e., CreditRating, ProsperScore, CreditScoreRevision, take account of history data, why they are different?, why we need them all?
strengths and limitations of your model.
Want to build a model to predict the BorrowerAPR, will do this in future, can use Machine Learning Algrithms.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00653 0.15629 0.20976 0.21883 0.28381 0.51229 25
From this figure, we can see several peaks, the most frequent one is around 0.2 and another bigger peak is around 0.36
The higher CreditRating is, the lower BorrowerAPR will be. However, the variance for each CreditRating level is not small, There are still some other imported features to control the finnal BorrowerAPR. From the trendency, it seems that one linear model can be built to predict the BorrowerAPR.
From the figure, we can see the weak relationship between TradesNeverDelinquent..percentage. and CreditRating, more ‘HR’ borrowers have small TradesNeverDelinquent..percentage. compared with ‘AA’ borrower. Moreover, we can see that if keep the creditRating fixed, smaller TradesNeverDelinquent..percentage tends to have bigger BorrowerAPR.
This is one loan data set, it takes me huge time to understand each variable, which includes understand the variables based on excel data description, search the related knowledge of credit loan and Prosper WebBank etc.
Then i try to explore the data by Univariate Plot and Bivariate Plot, begin to know what i can find based on the data. BorrowerAPR, yes, that’s what borrower and lender most care about. For borrowers, they want to know how can reduce the Borrower APR; For lenders, they want to know, what kind of loan will help them make lots of money and minimize their loss. I find lots of information in Bivariate Plot part. When the correlation matrix is calculated, i compare the value with the plot and finally totally understand what’s going on here.
In order to predict the BorrowerAPR better, i create some new variables, e.g, CreditRating, CreditScoreRange, CreditScoreRevision, EmploymentFlag, LengthHistory. However, the influence is not that obvious, just the relationship between BorrowerAPR and CreditScoreRevision improves a little compared with the relationship between BorrowerAPR and CreditScoreRange. Actually, i also find some variables, e.g., CurrentDelinquencies, PublicRecordsLast10Years etc, the difference bettween 0 and 1 is much larger, in future, will reduce their levels to 0 and not 0 to check more details.
BorrowerAPR has very strong relationship with CreditRating, strong relationship with ProsperScore, moderate relationship with CreditScoreRevision, weak relationship with BankcardUtilization, AvaliableBankcardCredit and TradesNeverDelinquent. Where, CreditRating has strong relationship with ProsperScore, moderate relationship with CreditScoreRevision. However, ProsperScore and CreditScoreRevision can describe the different view with CreditRating, therefore, they should be both features in this prediction model. Moreover, CreditScoreRevision has moderate relationship with BankcardUtilization, AvaliableBankcardCredit and TradesNeverDelinquent. Have not found the features that are not related to CreditRating but have relationship with BorrowerAPR, which makes sense, since all the score data (i.e., CreditRating, ProsperScore, CreditScoreRevision) has already taken account of all the history data.
I am still confused, how to get the CreditRating value? Calculated with history data, which includes payment history, credit utilization, length of creit history, new credit, credit mix etc? how to get the ProsperScore? Why so many score data in this data set?
Lots of questions have not been answers, need more clues to answer these questions. In other words, still not clear for part of data in the dataset. This part information can not be got by exploring data, should contact the data collectors for more details.
In the future, will build the prediction model to predict the BorrowerAPR. Will split the dataset to training data and testing data, build one model using Machine Learning Algrithms.